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Abstract Gene trees are evolutionary trees representing the ancestry of genes sam- 
pled from multiple populations. Species trees represent populations of individuals — 
each with many genes — splitting into new populations or species. The coalescent pro- 
cess, which models ancestry of gene copies within populations, is often used to model 
the probability distribution of gene trees given a fixed species tree. This multispecies 
coalescent model provides a framework for phylogeneticists to infer species trees from 
gene trees using maximum likelihood or Bayesian approaches. Because the coalescent 
models a branching process over time, all trees are typically assumed to be rooted in 
this setting. Often, however, gene trees inferred by traditional phylogenetic methods 
are unrooted. 

We investigate probabilities of unrooted gene trees under the multispecies coa- 
lescent model. We show that when there are four species with one gene sampled per 
species, the distribution of unrooted gene tree topologies identifies the unrooted species 
tree topology and some, but not all, information in the species tree edges (branch 
lengths). The location of the root on the species tree is not identifiable in this sit- 
uation. However, for 5 or more species with one gene sampled per species, we show 
that the distribution of unrooted gene tree topologies identifies the rooted species tree 
topology and all its internal branch lengths. The length of any pendant branch leading 



E. S. Allman 

Department of Mathematics and Statistics, University of Alaska Fairbanks, 
PO Box 756660, Fairbanks, AK 99775 USA 
E-mail: e.allman@alaska.edu 

Corresponding author: 
J. H. Degnan 

Department of Mathematics and Statistics, University of Canterbury 
Private Bag 4800, Christchurch, New Zealand 
E-mail: J.Degnan@math.canterbury.ac.nz 

J. A. Rhodes 

Department of Mathematics and Statistics, University of Alaska Fairbanks, 
PO Box 756660, Fairbanks, AK 99775 USA 
E-mail: j.rhodes@alaska.edu 



2 



to a leaf of the species tree is also identifiable for any species from which more than 
one gene is sampled. 

Keywords Multispecies coalescent ■ phylogenetics ■ invariants ■ polytomy 
Mathematics Subject Classification (2000) 62P10 ■ 92D15 



1 Introduction 



The goal of a phylogenetic study is often to infer an evolutionary tree depicting the 
history of speciation events that lead to a currently extant set of taxa. In these species 
trees, speciation events are idealized as populations instantaneously diverging into two 
populations that no longer exchange genes. Such trees are often estimated indirectly, 
from DNA sequences for orthologous genes from the extant species. A common as- 
sumption has been that such an inferred gene tree has a high probability of having 
the same topology as the species tree. Recently, however, increasing attention has been 
given to population genetic issues that lead to differences between gene and species 
trees, and how potentially discordant trees for many genes might be utilized in species 
tree inference. 

Methods that infer gene trees, such as maximum likelihood (ML) using standard 
DNA substitution models, typically can estimate the expected number of mutations 
on the edges of a tree, but not the direction of time. Phylogenetic methods therefore 
often estimate unrooted gene trees. In many cases, the root of a tree can be inferred 
by including data on an outgroup, i.e., a species believed to be less cl osely rel ated 
to the species of interest than any of those are to each other ( Jennings and Edwards! 
I2OO5I : |Poe and Chubbll2004l : iRokas et afcood ') . However, outgroup species which are too 
distantly related to the ingroup taxa may lead t o unreliable inference , and in some cases 
appr opriate outgroup species are not known ( Graham et all |2002| : iHuelsenbeck et al 
I2OO2I '). The root of a gene tree can alternately be inferred under a molecular clock 
assumption, i.e., if mutation rates are constant throughout the edges of a tree. In many 
empirical studies, however, such a molecular clock assumption is violated. Furthermore, 
without a molecular clock, inferred branch lengths on gene trees may not directly reflect 
evolutionary time, as substitution rates vary from branch to branch. For these reasons, 
one may have more confidence in the inference of unrooted topological gene trees than 
in metric and/or rooted versions. 

Methods for inferring rooted species trees from multiple genes have been developed 
which make use of rooted gene trees, topological or metric, which possibly differ from 
that of the species tree. Most commonly, such methods assume that the incongruent 
gene trees (i.e., gene trees with topologies different from the species tree) arise because 
of incomplete lineage sorting, the phenomenon that the most recent common ancestor 
for two gene copies is more ancient than the most recent population ancestral to the 
species from which the genes were sampled. Examples are shown in Fig. [l^-g, in which 
the lineages sampled from species a and b do not coalesce in the population immediately 
ancestral to a and b. Several approaches for infer ring species trees in this setti ng have 
been proposed, such as min imizing deep coalesce jMaddison and Knowleslbood). BEST 
( Liu and Pear lll2007l'). ESP JCarstens and Knowlesll2007h . STEM ("Kubatko ct afeoool') , 
Maximum Tre e jLiu et allboiol') falso called the GLASS tree |Mossc1 and Roc h 20lB)), 
and *BEAST (|Heled and Drummondlboid '). The analysis of incomplete lineage sorting 
requires thinking of rooted trees (the idea of an event being "more ancient" requires 
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Fig. 1 The unrooted gene tree T15 in the species tree ((((a, 6), c), d), e). The seven distinct 
rooted gene trees depicted in (a)-(g) all correspond to the same unrooted gene tree T15 
shown in (h). The rooted gene trees in (c) and (d) can only occur for this species tree if all 
coalescences occur above the root, in the population with the lightest shading. The rooted 
gene trees in (a), (b), (e), (f), and (g) can occur with coalescent events either all above the 
root or with some event in the population immediately descended from the root. Only one 
coalescent scenario is shown for each of the rooted gene trees. 



that time have a direction), and is modeled probabilistically using; coalesc ent theory 
( Hudsonlflii^ : lKingman|[l983 : iNordbordlioOll : lTaiima|[l983l : IWakelevlliooi ') . 

The coalescent process was first developed to model ancestry of genes by a tree 
embedded within a single population, and uses exponential waiting times (going back- 
wards in time) until two lineages coalesce. By conceptualizing a species tree as a tree 
of connected populations (c/. Fig. [T]), each with its own coalescent process, the mul- 
tispecies coalescent can mode l probabilities of rooted gene tree topologies with in a 
rooted species tree dPegnan an d Rosenberg 2009: Dcg nan and Salter 200 5: Nei 19871 : 
IPamilo and Neilll988l : iRosenbergI |2002| : lTakahatalll989l ). Although much of the work 
of this area has focused on one gene lineage sampled per population, extensions to 
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computing gene tree probabilities when more than one lineage is sampled from each 
population has also been derived (|Degnanll2O10l : iRosenber dbooj iTakahatalflOsi l. 



Under the multispecies coalescent, the species tree is a parameter, consisting of a 
rooted tree topology with strictly positive edge weights (branch lengths) on all interior 
edges. Pendant edge weights are not specified when there is only one gene sampled per 
species, because it is not possible for coalescent events to occur on these edges. Rooted 
gene tree topologies are treated as a discrete random variable whose distribution is 
parameterized by the species tree, with a state space of size (2n — 3)!! = 1 x 3 x ■ ■ ■ x 
(2n — 3), the number of rooted, binary tree topologies ( Felsensteinll200'i ) for n extant 



species (leaves). (Nonbinary gene trees are not included in the sample space since the 
coalescent model assigns them probability zero.) 

Results on rooted triples (rooted topological trees obtained by considering subsets 
of three species) imply that the distribution of roo ted gene tree topologies identifies 
the rooted species tree topology (|Degnan et al|[2009h . in spite of the fact that the most 
likely n-ta xon gene tree topology need n ot have the same topology as the species tree 



for n > 3 ( Degnan and Rosenberg! |2006| ). Internal branch lengths on the species tree 



can also be recovered using probabilities of rooted triples from gene trees. In particular, 
for a 3-taxon species tree in which two species a and b are more closely related to each 
other than to c, let t denote the internal branch length. If p is the known probability 
that on a random rooted topological gene tree, genes sampled from species a and b 
are more closely related to each other than cither is to a gene sampled from c, then 
t = -log((3/2)(l - p)) (jNci,1987. : .Wakc lcv 20(M)- Thus, for each population (edge) 
e of the species tree, choosing two leaves whose most recent ancestral population is e 
and one leaf descended from the immediate parental node of e, the length of e can be 
determined. We summarize these results as: 

Proposition 1 For a species tree with ?i > 3 taxa, the probabilities of rooted triple 
gene tree topologies determine the species tree topology and internal branch lengths. 

Because the probability of any rooted triple is the probability that a rooted gene 
tree displays the triple, we have the following. 

Corollary 2 For a species tree with n > 3 taxa, the distribution of gene tree topologies 
determines the species tree topology and internal branch lengths. 

Although previous work on modeling gene trees under the coalescent has assumed 
that trees are rooted, the event that a particular unrooted topological gene tree is 
observed can be regarded as t he event that any of its rooted versions occurs at that locus 
(|Heled and Drummondll2010l '). For 71 species, there are (2n ^5)!! unrooted gene trees, 
and each unrooted gene tree can be realized by 2n — 3 rooted gene trees, corresponding 
to choices of an edge on which to place the root. The probability of an n-leaf unrooted 
gene tree is therefore the sum of 2n — 3 rooted gene tree probabilities, and the unrooted 
gene tree probabilities form a well-defined probability distribution. 

In this paper, we study aspects of the distribution of unrooted topological gene 
trees that arises under the multispecies coalescent model on a species tree, with the 
goal of understanding what one may hope to infer about the species tree. We find 
that when there are only four species, with one lineage sampled from each, the most 
likely unrooted gene tree topology has the same unrooted topology as the species tree; 
however, it is impossible to recover the rooted topology of the species tree, or all 
information about edge weights, from the distribution of gene trees. When there are 
5 or more species, the probability distribution on the unrooted gene tree topologies 
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identifies tlie rooted species tree and all internal edge weights. If multiple samples are 
taken from one of more species, then those pendant edge weights become identifiable, 
and the total number of taxa required for identifying the species tree can be reduced. 

In the main text, we derive these results assuming binary — fully resolved — species 
trees. However, the results generalize to nonbinary species trees, which have internal 
nodes of outdegree greater than or equal to 2. Details for nonbinary cases are given in 
Appendix [Cl Implications for data analysis will be discussed in Section (6] 

We briefly indicate our approach. Because the distribution of the (2n — 3)!! (rooted) 
or (2n — 5)!! (unrooted) gene trees is determined by the species tree topology and its 
n — 2 internal branch lengths, gene tree distributions are highly constrained under the 
multispecies coalescent model. Calculations show that many gene tree probabilities 
are necessarily equal, or satisfy more elaborate polynomial constraints. Polynomials in 
gene tree probabilities which evaluate to for any set of branch lengths on a particular 
species tree topology are called invariants of the gene tree distribution for that species 
tree topology. A trivial example, valid for any species tree, is that the sum of all gene 
tree probabilities minus 1 equals 0. Many other invariants express ties in gene tree 
probabilities. For example, consider the rooted species tree ((a,6),c), where t is the 
length of the internal branch. Suppose gene A is sampled from species a, B from b, and 
C from c. Then the rooted gene tree {{A, B), C) has probability pi — 1 — (2/3) exp(— f) 
under the coalescent, while the two altern ative gene trees, {{A, C), B) and {{B, C), A), 
have probability P2 = P3 = (1/3) exp( -t) (jNeill 19871 ). Thus a rooted gene tree invariant 



for this species tree is 

P2-P3=0. (1) 

We emphasize that this invariant holds for all values of the branch length t. The species 
tree also implies certain inequalities in the gene tree distribution; for example, for any 
branch length t > 0, pi > P2- Because of such inequalities, the invariant in equation ([T)) 
holds on a gene tree distribution if, and only if, the species tree has topology ((a, b), c). 

Different species tree topologies imply different sets of invariants and inequalities 
for their gene tree distributions, for both rooted and unrooted gen e trees. We note that 
previous work on invariants for statistical models in phylogenetics ( AUman and Rhode j 



I2OO3I : ICavender and FelsensteinI 19871 : lLakell987l ') has focused on polynomial constraints 
for site pattern probabilities; that is, probabilities that leaves of a gene tree display 
various states (e.g., one of four states for DNA nucleotides) under models of charac- 
ter change, given the topology and branch lengths of the gene tree. These approaches 
have been particularly useful in determining i dentifiability of (gene) trees given se- 
quence data under d ifferent models of mutation ( AUman and Rhode j|200^ : I2OO8I : l2009l : 
lAUman et a]|2010aH bh. 

In this paper, our methods focus on understanding linear invariants and inequalities 
for distributions of unrooted gene tree topologies under the multispecies coalescent 
model. Here gene trees are branching patterns representing ancestry and descent for 
genetic lineages, and are independent of mutations that may have arisen on these 
lineages. This is therefore a novel application of invariants in phylogenetics. 



2 Notation 



Let X denote a set of |X| = n taxa, and let ip denote a rooted, binary, topological 
species tree whose n leaves are labeled by the elements of X. If tp'^ is further endowed 
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with a collection A of strictly positive edge lengths for the n — 2 internal edges, then 
= (■i^''', A"*") denotes a rooted, binary, leaf-labeled, metric species tree on X. Note 
that edge lengths in the species tree do not represent evolutionary time directly, but 
are in coalescent units, that is units of r/Ne, where r is the number of generations and 
A^e is the effective population size, the effective number of gene copies in a population 
( Degnan and Rosenbersll2009l '). As pendant edge lengths do not affect the probability 
of observing any topological gene tree, rooted or unrooted, under the multispecies 
coalescent model with one individual sampled from each taxon, they are not specified 
in A^. To specify a particular species tree , we use a modified Newick notation which 
omits pendant edge lengths. For instance, a particular 4-taxon balanced metric species 
tree is = ((a, b):0.1, (c, d):0.05). Rooted 4- and 5-taxon species trees with branch 
lengths which will be used later in this paper are depicted in Fig. [51 We refer to the 
5-taxon tree shapes as balanced, caterpillar, and following Rosenberg, pseudocaterpillar. 




a bed a bed 




a b c d e a b c d e a b d e c 



balanced caterpillar pseudocaterpillar 

Fig. 2 Model species trees with branch lengths used to determine probabilities of unrooted 
gene trees in this paper. The two 4-taxon species trees in (a) and (b) each have the same 
unrooted topology, namely a tree with the ab\cd split. The three 5-taxon species trees in 
(c)-(e) also share one unrooted topology, the topology with the splits ab\cde and abc\de. 

Replacing ' + ' with ' — ' denotes suppressing the root, so that tp~ is the unrooted 
binary topological species tree, A~ the induced collection of n — 3 internal edge lengths 
on , and a~ — (■0", A~) is the unrooted metric species tree. An unrooted topology 
can be specified by its nontrivial splits — the partitions of the tazxa induced by removing 
an internal edge of the unrooted tree. For example, Tis in Fig. [T}i has splits BE\ACD 
and ABE\CD. A set of all taxa descended from a node in a rooted tree forms a clade, 
the rooted analog of a split. For example, the rooted gene tree in Fig. [T^ has 2-clades 
{B, E} and {C, D} and the 3-clade {A, B, E}. 
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For any set of taxa 5 C X, we let 7s denote the collection of all unrooted, binary, 
leaf-labeled topological gene trees for the taxa S. We use the convention that while 
lower-case letters denote taxa on a species tree, the corresponding upper-case letters 
are used as leaf labels on a gene tree; Thus A denotes a gene from taxon a, etc. For 
example, if X = {a, b, c, d}, then 

Tx = {AB\CD, AC\BD, AD\BC}. 

Given any sort of tree (species/gene, rooted/unrooted, topological/metric) on X, 
appending '(•S')' denotes the induced tree on the taxa S C X. By 'induced tree' here 
we mean the tree obtained by taking the minimal subtree with leaves in S and then 
suppressing all non-root nodes of degree 2. Instances of this notation include a~^{S), 
a-{S), V'+(S'), ip~{S), and r(S'). 



3 The multispecies coalescent model 

Several papers have given examples of applying the coalescent process to multiple 
species or populations to deri ve examples of pr obabilities of roo ted ge ne tree topologies 



given species trees (|Ne ] |l987l : IPamilo and NeilllQSSi : lRosenb crg"200 ^) with the genera l 



3 genera l 

case (for any n-taxon, rooted, binary species tree) given in (,Degnan and Salteiil20o"5l '). 
We present the model here with only one individual sampled per taxon, as that will 
be sufficient for our analysis. 

Under the multispecies coalescent model, waiting times (going backwards in time) 
until coalescent events (nodes in a rooted gene tree) are exponential random variables. 
The rate for these variables is Q), with i the number of lineages "entering" a pop- 
ulation, i.e., a branch on the species tree. Gene tree probabilities can be computed 
by enumerating all possible specifications of branches in which each coalescent event 
occurs, and computing the probability of these events in each branch, treating each 
branch as a separate population. In particular, the probability th at i lineages coalesce 
into j lineages within time t is represented by the function g^j (t) jTavardl 19841 ') ■ which 
is a linear combination of exponential functions: 

.,.>--±...{-{^^^ ...... 

(2) 

Here f > is time measured in coalescent units. The functions gij have the prop- 
erty that for any i > 1 and any t > 0, gij{t), j = is a discrete probabil- 
ity distribution, that for any i > 1, Vaat^oo guif) = 1, and that limj^o <?ii(i) = 1- 
These last two properties express the ideas that given enough time, all lineages even- 
tually coalesce (there is only one lineage remaining in a population) and that over very 
short time intervals, it is very likely that no coalescent events occur. Finally, note that 
gi,(t) =exp(-i(i- l)t/2). 

As an example of using this function to determine rooted gene tree probabilities, 
consider the rooted caterpillar species tree {{{{a,b):x,c):y,d):z,e) of Fig. [2Ji, and the 
rooted gene tree [[[[B,E),A),C), D). Since this gene tree requires a specific ordering 
of coalescences, and the first of these can only occur in the population above the 
root of the species tree, the only scenario to consider is that shown in Fig. [T]:;. In 
the population ancestral to species a and b, there are two lineages which must fail 
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to coalesce in time x, and this event has probabihty 322(2;) = exp(— x). Similarly, 
the events in the populations with durations y and 2 have probabilities exp(— 3j/) 
and exp(— 6z), respectively, because no lineages coalesce in those intervals. For the 
population ancestral to the root, all lineages eventually coalesce, and the probability 
for events in this population is the probability of observing the particular sequence of 

coalescence events, which is ((2) (2) (2) (2)) ~ 1/180. The probability of the rooted 
gene tree given the species tree is therefore exp(— a; — 3t/ — 62)/180. It is often convenient 
to work with transformed branch lengths, where if a branch has length x, we set 
X — exp(— X-). Using this notation, the rooted gene tree has probability XY^ /ISO. 

As another example, consider the gene tree {{{B,E),A),{C,D)) given the same 
species tree, {{{{a,b):x,c):y,d):z,e). For this rooted gene tree to be realized, either C 
and D coalesce as depicted in Fig. [1^, in the population immediately below the root 
(which we call the "near the root" population), or C and D coalesce above the root. 
Regardless, all other coalescent events must occur in the population above the root. We 
therefore divide the calculation of the rooted gene tree topology into these two cases. If 
all coalescent events occur above the root, the rooted gene tree probability is calculated 
as in the preceding paragraph, except that there are three possible orders in which the 
coalescent events could occur to realize the rooted gene tree, and the probability for 
this case is thus XY^Z^/ 60. In the case where C and D coalesce "near the root," there 
are no coalescent events in the populations with lengths x and y, thus contributing a 
factor of exp(— a; — 3y) to the probability. The probability for events near the root is 

(2) 943 (-2^)1 where the coefficient is the probability that of the four lineages entering 
the population, the two that coalesce are C and D. Because there are four lineages 
entering the population above the root of the species tree, the one sequence of coalescent 
events that results in the gene tree topology has probability ((2) (2) (2)) = 1/18. The 
total probability of the rooted gene tree topology (((B, E),A), (C, D)) given the species 
tree {{{a,b):x,c):y,d):z,e) is therefore 

11 
g22{x)g33(y)-j-g43{z)-j--pr—r- + g22{x)g33{y)94:i{z 



=XY^-(2Z^ - 2Z^)— + —XY^Z^ 
6^ n8 60 

=^XY^Z^ - —XY^Z^. 
54 540 

Probabilities of the other rooted gene trees in Fig. 1 can be worked out similarly by 
considering a small number of cases for each tree. Methods for enumerating all possible 
cases have been developed using the concept of coalescent his tory, a list of populations 



in which the coalescent events occur (jPeenan and Salterll2005h . Each coalescent history 
h has a probability of the form 

Ti-2 

c(h) J]^ 9j(h,&),j(h,b)(a:b) (3) 
b=l 

where Xf, is the length of internal edge b of the species tree and c(h) is a constant that 
depends on the coalescent history h and the topologies of the gene and species trees, 
but does not depend on the branch lengths Xf,. This expression is a linear combination 
of products of terms exp[— fc(fc — l)a;f,/2], fc = 2, . . . , n — 1, so using the transformations 
Xij — exp(— Xf,), probabilities of coalescent histories can thus be written as polynomials 
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in the transformed branch lengths of the species tree. Because gene tree probabilities 
are sums of probabilities of coalescent histories, gene tree probabilities can also be 
written as polynomials in the transformed branch lengths. 

Finally, unrooted gene tree probabilities, which are sums of rooted gene tree prob- 
abilities, can also be expressed as polynomials in the transformed branch lengths. We 
thus can derive polynomial expressions for the probabilities of unrooted gene trees 
given a species tree. 



4 Results 

The unrooted topological gene tree distribution under the multispecies coalescent 
model on species tree , with one lineage sampled per species, will be denoted by 
P = , so that P(r) denotes the probability of observing gene tree T £ Tx ■ 

For ease of exposition, we assume throughout this section that the species tree 
is binary. See Section [5] for the polytomous case. 



4.1 4-taxon trees 

We first consider the case of four taxa, and so let X — {a,b,c,d}. Using non-trivial 
splits as indices, the set of gene trees is 

Tx = {TAB\CDjTj^C\BD^Tj^jj\gQ}. 

With four taxa, there are only two shapes for species trees: the balanced tree, with 
two clades of size 2 (Fig. [2^); and the rooted caterpillar tree with a 2-clade nested inside 
a 3-clade (Fig.l^Js). Of the 15 possibilities for , there are three labeled balanced tree 
topologies, and 12 labeled caterpillar topologies. It is only necessary to compute gene 
tree probabilities for a single labeling of the leaves of each species tree shape, since 
permuting labels immediately gives the distribution for other choices. 

For a balanced tree a'^ — {{{a,b):x, {c, d):y) shown in Fig. one computes, as 
described in the previous section, that the gene tree distribution is given by 

r,ATAC\BD)=V,^iTAD\Bc) = ^e-'^^'+^l 

For a rooted caterpillar species tree a'^ — ({(a,b):x,c):y,d) shown in Fig. [^Ja, one 
finds 

r,ATAB\CD) = 'i-~le'", 
IP(t+(Tac|bd) = IF'cr+(rALi|Bc) = 3^ 

Thus for any 4-taxon species tree, from the distribution Pg.+ one can identify the 
unrooted species tree topology as that of the most probable unrooted gene tree T. 
The one internal edge length on tp~ [i.e., x+y in the balanced case, x for the caterpillar) 
can be recovered as — log (| (1 — P(r))) . Thus a~ — {ip~ , X~) is identifiable. 
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Furthermore a is not identifiable since the above calculations show that for any 
a; > 0, j/i > 0, and x > z > Q the following rooted species trees produce exactly the 
same unrooted gene tree distribution: 



{{a,h):x,c):yi,d), 
{{a,b):x,d):y2,c), 
((c, d):x,a):y3,b), 
((c, d):x,b):yi,a), 
((a, b):z, (c, d):x — z). 

We summarize this by: 

Proposition 3 For \X\ — 4 taxa, a~ is identifiable from but is not. 

We note that if the unrooted gene trees are ultrametric with known bran ch lengths, 
then their rooted topologies are known by midpoint rooting I Kim et "ailll993l ). and thus 
CT^ is identifiable from unrooted ultrametric 4-taxon gene trees. 



4.2 Linear invariants and inequalities for unrooted gene tree probabilities for 5-taxon 
species trees 

To establish identifiability of all parameters when there are at least 5 taxa, we will 
argue from the 5-taxon case. In this base case we will use an understanding of linear 
relationships — both equalities and inequalities — that hold between gene tree prob- 
abilities. The relationships that hold for a particular gene tree distribution reflect the 
species tree on which it arose. 

In this section, we determine all linear equations in gene tree probabilities for each 
of the three shapes of 5-leaf species trees. Following phylogenetic terminology, these are 
the linear invariants of the gene tree distribution. We emphasize that these invariants 
depend only on the rooted topology, ifj^ , of the species tree, and not on the branch 
lengths A"*". Although some of these invariants arise from symmetries of the species 
tree, others are less obvious. Nonetheless, we give simple arguments for all, and show 
that there are no others. In addition, we provide all pairwise inequalities of the form 
Ui > Uj for the three model species trees in Figs. [2j;-e. 

With X — {a, b, c, d, e}, there are 15 unrooted gene trees in Tx , which we enumerate 
in Table [5] of Appendix Probabilities for each of the 15 unrooted gene trees are 
obtained by summing probabilities of seven of the 105 rooted 5-taxon gene trees, as 
shown in Tables [4] and [5] of AppendixlX] In Appendix[B] formulas for the unrooted gene 
tree distribution are given for one choice of a leaf-labeling of each of the three possible 
rooted species tree shapes. Noticing that many of the gene tree probabilities are equal, 
one might hope that which ones are equal would be useful in identifying the species 
tree from the distribution. 

For each species tree one can computationally, but entirely rigorously, determine a 
basis for the vector space of all linear invariants. We report such a basis for each of the 
species tree shapes below, in Tables [T][3l Only for one of the tree shapes is an additional 
invariant that is not immediately noticeable produced by this cal culation. While ou r 
computations were performed using the algebra software Sing ular jCreuel et alll2009l '). 
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many other packages would work as well, or one could do the calculations without 
machine aid. 

In the tables and discussion below, we omit mention of the trivial invariant, 

15 

which holds for any choice of a^. We instead only give a basis for the homogeneous 
linear invariants. 

We use the following observation. 

Lemma 4 // all coalescent events occur above the root (temporally before the MRCA 
of all species) of a 5-taxon species tree, then all 15 of the unrooted topological gene trees 
are equally likely. 

Proof If all coalescent events occur above the root, then regardless of the species tree, 
we are considering five labeled lineages entering the ancestral population, and then 
coalescing. Because all unrooted gene trees have the same unlabeled shape, all coales- 
cent histories leading to one gene tree correspond to equally likely coalescent histories 
producing another, by simply relabeling lineages. □ 

Note that the claim of this lemma is special to five taxa. For six taxa, with two different 
unrooted gene tree shapes possible, the analogous statement is not true. 

4-2.1 Balanced species tree 

Suppose tp^ = (((o,6),c), (d, e)), as depicted in Fig. [5};. Because ct"^ is invariant under 
interchanging taxa a and b, any two gene trees that differ by transposing leaves A and 
B must have the same probability. Similarly, interchanging D and E on a gene tree 
cannot change its probability. We refer to the first permutation of labels using cycle 
notation as (ah), and the second as (de). More formally, assuming generic values for 
A"*", the symmetry group of cr^ is the 4-element group generated by the transpositions 
(ab) and (de), and the gene tree probability distribution must be invariant under the 
action of this group on gene trees. These symmetries thus give 'explanations' for many 
invariants holding. 

A different explanation for some invariants is that some unrooted gene trees can 
only be realized if all coalescent events occur above (more anciently than) the root of 
the species tree. For example, any realization of the gene tree T15 with splits BE\ACD 
and ABE\CD (Fig. [T}i) requires that the first (most recent) coalescent event either 
joins lineages B and E, or joins C and D. Because both of these events can only occur 
above the root, all events must take place above the root. Another such gene tree is Tn, 
with splits AE\BCD and ACE\BD. Thus by Lemma U the unrooted gene trees Tn 
and T15 must have the same probability, even though they do not differ by a symmetry 
as described in the last paragraph. We refer to this reasoning as the "above the root" 
argument. 

Some invariants can be explained in several ways. For example, the same invariant 
might be explained by two different symmetries or by both a symmetry and an above- 
the-root argument. In Table [TJ we list a basis for homogeneous linear invariants, and 
give only one explanation for each. Here Uj = P(Ti). 
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Table 1 Invariants for the rooted species tree V+ = {((a, b), c), (d, e)) 



Invariant Explanation 



Ml4 


- "15 


= 


(de) 


Mil 


— "15 


= 


above root 


"10 


— "15 


= 


{ah) 


Ug 


- «12 


= 


(de) 


US 


- "15 


= 


above root 


U-T 


- "15 


= 


{ab){de) 


U6 


- "12 


= 


{ab){de) 


«5 


- "12 


= 


(ab) 


li4 


- "13 


= 


(ab) 


U2 


— "3 = 


= 


{de) 



These equalities give the following equivalence classes of unrooted gene trees ac- 
cording to their probabilities: 

{Tl}, {T2,T3}, {r4,Ti3}, {T5,Te,Tg,Ti2}, {T7,Ts,Tio,Tii,Ti4,Ti5}. 

For any branch lengths on this species tree, we also observe the inequalities 

Ml > 112 , 114 > Its > ^7- (4) 

These inequalities were found by first expressing the probability of each Tj as a 
sum of positive terms corresponding to coalescent histories, such as expression Q, and 
then, by comparing coefficients in these sums, determining instances in which Ui > Uj 
must hold. Intuitively, this means that any realization of Tj corresponds to a realization 
of Tj, but that there are additional ways that Tj can be realized. 

The inequalities in can all be checked by elementary arguments using the explicit 
formulas of Appendix [B] For instance, since X,Y,Z£ (0, 1), 

ui-U2 = l--X-YZ+ -XYZ + -XY^Z =1-YZ ~ ix(4 - 3YZ - Y^Z) 
1 ^ g 2 6 6 ^ ' 

> 1 - - i(4 - 3YZ - Y^Z) = i - ^YZ{3 - Y^) 

>l- Iy{3 - > 0. 

In particular, there is always a 6-element equivalence class of trees which has the 
strictly smallest probability associated with it, and a 4-element class which has the 
next smallest probability associated to it. While the class associated to the largest 
probability is always a singleton, these inequalities do allow for the remaining two 
classes of size 2 to degenerate to a single class of size 4. 

Numerical examples can be used to show that there are no inequalities of the form 
Ui > Uj that hold for all branch lengths X,Y, and Z that are not listed in @. 

4-2.2 Caterpillar species tree 

Suppose = ((((a, 6), c), d), e), as depicted in Fig. [2ji. Then the symmetry group of 
the tree is generated by (ab), and has only two elements. 

Although no unrooted gene trees require that all coalescent events occur above 
the root of this species tree, there are gene trees that require that all events be either 
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above the root or "near the root" in the following sense. Consider the gene tree T15 
with splits BE\ACD and ABE\CD (Fig. [l]i). This gene tree can be realized either by 
all events occurring above the root (in which case either the BE coalescence or the CD 
coalescence could be first), or by 1, 2, or 3 events occurring in a specific order in the 
near-the-root population which is ancestral to species a, b, c, and d but not to e, with 
all further events above the root. For example, if there are two coalescent events in this 
population, then the gene tree must have {{CD) A) as a subtree (Fig. [T}3,e,f), and C 
and D must coalesce most recently followed by the coalescence of A. In case 1, 2, or 3 
events do occur below the root, these must be in the specific order 1) CD coalesce, 2) 
ACD coalesce, 3) ABCD coalesce. Another gene tree which leads to a similar analysis 
of how coalescent events must occur for the gene tree to be realized is T14, with splits 
BD\ACE, ABD\CE. Consequently, T14 has the same probability as Tis, even though 
these two gene trees do not differ by a symmetry. Similar arguments apply to trees T-j, 
Tg, Tio, and Tn. The near-the-root argument and symmetry between a and b explain 
all linear invariants but the last in Table [21 

Table 2 Invariants for the rooted species tree ^/i"*" = ((((a, b), c), d), e) 

Invariant Explanation 



Ui4 


- Ml5 = 


near root 


Mil 


- M15 = 


near root 


MlO 


- "15 = 


(ah) 


US 


- M15 = 


near root 


U7 


- M15 = 


near root 


Mg 


-Ug =0 


{ah) 


"5 


- Ui2 = 


{ab) 


M4 


- Mi3 = 


{ab) 


M2 — M3 


+ ug — U12 = 


marginalization 



To explain the last invariant in Table (2] we provide a marginalization argument. 
We use the fact that for 4-taxon trees the two unrooted gene trees that are inconsistent 
with the species tree are equiprobable. Thus, marginalizing over a to trees on {6, c, d, e}, 
we have that P(Tb£)|c_e) = '^{Tbe\cd)- Hence, 

"2 + Wg -f M7 -f Uii + lti4 = U3 -f lt5 -f Itg -f Uio + 1115. 

Because the last 3 terms on each side are equal to U15, we may cancel those. Replacing 
Ug with Mg, and W5 with U12, then gives the last invariant in the table. 

Table [5] yields the following equivalence classes of gene trees according to their 
probabilities: 

{Ti}, {T2}, {Ts}, {T4, Tia}, {n, Tis}, {Tg, Tg}, {Tj, Tg, Tio, Tn, Tm, T15}. 
We also observe that the inequalities 

Ui > U2,U4 > U5 > U7, 

ti3 > U2,ue > u^> (5) 

hold for all branch lengths on this species tree, and that there are no other inequalties 
of the form Uj > Uj that hold for all branch lengths, by arguments similar to those for 
the balanced tree. 
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4-2.3 Pseudocaterpillar species tree 

Suppose t/i^ = (((a, 6), (d, e)),c), as depicted in Fig. [5^. Then the symmetry group of 
the tree cr^ is generated by {ab) and (de), and has four elements. (Note that inter- 
changing the two cherries, for instance by (ad) (be), is a symmetry of ifj^ , but is not a 
symmetry of ct^ for generic edge lengths.) 

While no unrooted gene trees require that all coalescent events occur above the 
root of this species tree, some unrooted gene trees require that all events be either near 
the root or above the root. The gene tree Tis, with splits BE\ACD and ABE\CD, can 
be realized either by all events occurring above the root (in which case either the BE 
coalescence or the CD coalescence could be first), or by 1, 2, or 3 events occurring in 
a specific order in the population ancestral to species a, b, c, and d but not to e, with 
all further events occurring above the root. In case 1, 2, or 3 events do occur below the 
root, these must be in the specific order 1) BE coalesce, 2) ABE coalesce, 3) ABDE 
coalesce. Another gene tree which leads to a similar analysis of how coalescent events 
must occur for the gene tree to be realized is T12, with splits AE\BCD, ADE\BC . 
Thus T12 and Tis are equiprobable, even though they do not differ by a symmetry. 

A basis for homogeneous linear invariants of unrooted gene tree probabilities, along 
with explanations for each is given in Table O 

Table 3 Invariants for the rooted species tree ip+ = {({a, b), {d, e)), c) 

Invariant Explanation 



Ml4 


- "15 


= 


(de) 


U12 


— "15 


= 


near root 


"10 


- "15 


= 


{ah) 


UQ 


- "15 


= 


near root 


US 


- "11 


= 


{ah) 


U-T 


- "15 


= 


{ah)(de) 


U6 


- "15 


= 


near root 


"5 


- "15 


= 


near root 


U4 


- "13 


= 


{ah) 


U2 


— "3 = 


= 


{de) 



We thus obtain the following equivalence classes of unrooted gene trees according 
to their probabilities: 

{^1}, {T2,T^}, {r4,Ti3}, {TgjTii}, {r5,r6,T7,Tg,Tio,Ti2,Ti4,Ti5}. 

For all branch lengths on this species tree, we also observe the inequalities 

Ml > U2,Ui,Ua > U5 (6) 

and note that there are no other inequalities of the form Uj > Uj that hold for all 
possible branch lengths. In particular, the 8-element equivalence class of trees always 
has the strictly smallest probability associated with it. 

4.3 Species Tree Identifiability for 5 or more Tajca 

We will use several times the following observation, which is clear from the structure 
of the coalescent model. (In fact, this has already been used in Section 14.2.21 in the 
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marginalization argument explaining a linear invariant for the caterpillar tree.) While 
we state the lemma for unrooted gene trees, there is of course a similar statement for 
the distribution of rooted gene trees. 

Lemma 5 // S C X and T' £ Ts, then 

TeTx 
T{S) = T' 

As a consequence of the analysis for 4-taxon trees in Section 14.11 we obtain the 
following. 

Corollary 6 For any X , determines a~ . 

Proof We assume \X\ > 4, since otherwise there is nothing to prove. For any quartet 
Q C X of four distinct taxa, by Lemma [5] Pg.+ determines ¥^+(^Qy Thus (7~{Q) 
is determined by Proposition [3] Thus all unrooted quartet trees induced by are 
determined, along with their internal edge l engths. Tha t all induced quartet topologies 
determine the topology i/;" is well known (|Steellll992l l. Because each internal edge of 
ip~ is the internal edge for some induced quartet tree, A~ is determined as well. □ 

For the remaining arguments to determine cr^, we may assume that is already 
known. We focus first on the |X| = 5 case, and thus assume that X = {a, b, c, d, e} and 
that has non-trivial splits ab\cde and abc\de. 

Proposition 7 For |X| = 5 the rooted species tree topology is determined by . 

Proof From Section [4.21 for generic values of A"^, the caterpillar leads to seven distinct 
gene tree probabilities, with class sizes 1,1,1,2,2,2,6; the pseudocaterpillar gives five 
distinct probabilities, with class sizes 1,2,2,2,8; and the balanced tree gives five dis- 
tinct probabilities, with class sizes 1,2,2,4,6. Thus the (unlabeled) shape of ifj^ can be 
distinguished for generic edge lengths. However, for certain values of these parameters 
the classes can degenerate, by merging. 

To see that the tree shapes can be distinguished for all parameter values, observe 
that the inequalities Q-lIB} of Section 14.21 on gene tree probabilities ensures the class 
associated to the smallest probability always has size 8 for the pseudocaterpillar, while 
for the other shapes the size of this class is always 6. Moreover, for the caterpillar 
and balanced trees the size of the class associated to the second smallest probability 
must be exactly 2 and 4, respectively. Thus, these class sizes allow us to determine the 
unlabeled, rooted shape (balanced, caterpillar, or pseudocaterpillar) of the species tree. 
In addition, from Corollary |6l we also know the labeled, unrooted topology (i.e., the 
splits) of the species tree, ip~ . To determine the labeled, rooted topology, we consider 
cases depending on the unlabeled, rooted shape determined from the class sizes. 

If the species tree is balanced, from the splits we know that tp'^ — (((a, 6), c), (d, e)) 
or V"*" = {{a,b), (c, (rf,e))). But the gene tree Tt, with splits AD\BCE and ABD\CE, 
can be realized on the first of these species trees only if all coalescent events occur 
above the root; on the second species tree, can be realized other ways as well. Thus 

would fall into the 6-element class of least probable gene trees for the first but not 
the second species tree. This then determines i/j"*". 

For a caterpillar species tree, from the splits we know has as its unique 2- 
clade either {a,b} or {d,e}. By considering the cherries on the two gene trees in the 
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class of those with the second smallest probability, we see the 2-clade is determined as 
those tajca that appear in cherries with c. For simplicity, we henceforth suppose that 
the 2-clade is found to be {a,b}. Thus, Tp"^ = ((((a, 6), c), d), e) or ((((a, b), c), e), d). 
Then from the inequality © in Section we find that PiTs) > P(T2) if = 

((((a, fe), c), d), e), while this inequality is reversed if ip~^ = ((((a, 6), c), e), d). 

In the case of the pseudocaterpillar species tree, because the splits of tp" are known, 
there is only one possibility for -ip'^ . Thus is determined. □ 

Proposition 8 For \X\ = 5, Pg.+ determines a'^ = (?/>^, A'''). 

Proof By Proposition [7] , Tp'^ is determined. From Corollary [S] is also determined. 
Thus all elements of A^ except for the edges incident to the root are determined. In 
the balanced case, the sum of these two unknown edge lengths is determined, but 
in the other cases we have yet to determine any information about the single such 
non-pendant edge length. We therefore consider each of these cases in order. 

If Tp'^ is balanced, we may assume a"'" — {{{a,b):x,c):y),{d,e):z), with y,z still to 
be determined. As the unrooted internal edge length y + z is known, it is enough to 
determine y. From the gene tree probabilities in Appendix lB.il it follows that 

XYZ = 6u5 +9^7, 
XY^Z = 15u7. 

Thus, 

If -ip^ is a rooted caterpillar, we may assume = {{{{a,b):x,c):y,d):z,e). Only z 
remains to be determined. Using the explicit formulas for gene tree probabilities given 
in Appendix IB. 21 one checks that 

XY^ ^ 3{-U2 + U3 + 5u7) 
XY^Z^ = 15{u2 ~U3 + ur) 

and thus 

z = - log(Z) = - log (^^-—^-^ j . (8) 

If tp^ is the pseudocaterpillar, we may assume = {{{a,b):x, {d, e):y):z, c), with 
z still to be determined. The gene tree probabilities listed in Appendix IB. 31 show that 

XY = 12u5 + 3us 
XYZ^ = 30u5 - 15u8. 

Thus, 

. = -log(Z) = ilogf^^^i±^V (9) 

Note that equations Hlip - (|13p of Appendix |B] can be used to show that the arguments 
of the logarithms in equations ((Tl)-® are always strictly greater than 1. □ 
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While this proof used particular formulas to identify the remaining edge lengths in 
A^, note that many variants could have been used in their place. This simply reflects 
the many algebraic relationships (both linear and non-linear invariants) between the 
various gene tree probabilities. 

With the |X| = 5 case completed, we obtain the general result. 

Theorem 9 The unrooted topological gene tree distribution arising from the mult- 
species coalescent model for samples of one lineage per taxon determines the metric 
species tree provided \X\ > 5. // |X| = 4, Pg.+ determines only the unrooted metric 
species tree . 

Proof By Corollary (6] a" — (i/;^,A^) is determined. 

If \X\ > 5, consider a speciflc edge e of tp~ , and all 5-taxon subsets S C X such 
that the induced unrooted tree ?/'~(S') has e as an edge. If the root, p, of Tp'^ lies on 
e, then the root of tlj~^{S) is also p and thus the root of 'ip'^{S) lies on e for all such 
S. If p does not lie on e, then there exists an 5* with the root of il^^jS) not on e. To 
see this, for any set Q C X of four taxa which distinguishes e ( Steel|[l993 . Proposition 



6), choose a: G X \ Q so that the MRCA of S = Q U {x} is p. Then i^^iS) has root p, 
which is not on e. 

Thus using Lemma [5] and Proposition |8] to determine the root location of such 
tli~^{S) for each e, we can determine 4''^ . Then the length of any internal edges incident 
to the root of ip~^ can be recovered by choosing a 5-taxon subset S such that il'~^{S) has 
these edges, and applying Lemma [5] and Proposition [8] again. Thus is determined. 

Proposition |3] gives the case |X| = 4. □ 

Theorem [5] gives an alternate approach to establishing Corollary [21 in cases with 
\X\ > 5, since the distribution of rooted gene trees determines that of unrooted gene 
trees. 

Theorem [5] can also be used to show that if multiple lineages are sampled from 
some or all of the taxa, then the unrooted gene tree distribution contains additional 
information on the species tree, as follows. 

Corollary 10 Consider a species tree on taxon set X , and, for some l.i > 0, the 
distribution of unrooted topological gene trees under a multispecies coalescent model of 
samples of li individuals from taxon i. Suppose that either \X\> A and that there is at 
least one i such that li > 2, or that \X\ — 3 and that there are at least two values of 
i such that > 2. Then the gene tree distribution determines the species tree's rooted 
topology, internal edge lengths, and for any taxon with li > 1 the length of the pendant 
edge leading to taxon i. 

Proof We may assume all £i are either 1 or 2, by marginalizing over any additional 
individuals sampled, if necessary. 

Construct an extended species tree by attaching to any leaf i for which li — 2 a. 
pair of edges descending to new leaves labelled ii and 12, so the extended species tree 
has £ — X^^Li leaves. The pendant edge leading to taxon i in the original species 
tree becomes an internal edge on the extended tree, and retains its previous length. 
The lengths of the new pendant edges in the extended tree can be chosen arbitrarily, 
or left unspecified. Then a coalescent process on the extended £-taxon tree with one 
sample per leaf leads to exactly the same distribution of topological gene trees as the 
the multiple-sample process on the original species tree. 

Applying Theorem [9] to the extended tree, we obtain the result. □ 
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5 Nonbinary species trees 

The results for binary species trees generalize to nonbinary species trees as well. When 
species trees are allowed to be nonbinary, there are two unlabeled 3-tax on tree shapes , 
five unlabeled 4-taxon tree shapes, and 12 unlabeled 5-taxon tree shapes ( Cavlevl[l857l ). 



Probabilities of binary, unrooted gene tree topologies given a nonbinary species tree 
can be obtained by considering the limiting probability as one or more branch lengths 
go to zero in the formulas derived for binary species trees. We note that under the 
standard Kingman coalescent, gene trees, which depend on exponential waiting times, 
are still binary with probability 1 even when the species tree has polytomies. 

For the 3-taxon species tree, ((a, b):t, c), letting f — > 0, the rooted gene tree proba- 
bilities are each 1/3 in the limit. Thus the unresolved 3-tax;on rooted species tree can be 
identified from the gene tree distribution from the presence of three equal probabilities; 
whereas for a resolved species tree, exactly one gene tree has probability greater than 
1/3. Similarly, polytomies in any larger species tree can be identified by considering 
rooted triplets. A species tree node has three or more descendants if the three rooted 
gene trees obtained from sampling one gene from three distinct descendants of the node 
have equal probabilities. 

For 4-taxon species trees, the completely unresolved topology (a, b, c, d) can not 
be distinguished from the partially unresolved {{a,b,c):y,d) from unrooted gene tree 
probabilities as both result in equal probabilities of the three binary, unrooted gene 
trees on these taxa. Similarly, the resolved species trees {{{a,b):x,c):y,d) and the par- 
tially unresolved {{a,b):x,c,d) yield the same unrooted gene tree probabilities, with 
^iT+ 0'ab\Cd) = 1 ~ §^~^- These observations lead to the conclusion, as in the binary 
case, that 4-taxon unrooted gene tree probabilities identify the unrooted (possibly un- 
resolved) species tree, but do not identify the root. Thus Proposition [3] is still valid 
when is nonbinary. 

Identifiability of possibly-nonbinary rooted species trees for 5 or more taxa from 
probabilities of unrooted gene tree topologies can be established using arguments sim- 
ilar to those of the binary case. While we defer the detailed proofs to Appendix [C] we 
state these results as follows: 

Proposition 11 Proposition^^ Corollary \^ Propositions^ and\^ Theorem\^ and 
Corollary 1 1 0\ remain valid if is nonbinary. 

We note that a species tree with a polytomy is equivalent to a model of a resolved 
species tree with one or more branch lengths set equal to zero, and therefore that a 
resolved species tree and a polytomous species tree can be regarded as nested models. 
Although it might be difficult to distinguish polytomous versus resolved species trees 
from finite amounts of data, the nested relationship of these models suggests that 
likelihood ratios could be used to determine whether an estimated species tree branch 
length is significantly greater than 0. A previous study (|Poe and Chubbll2004l l argued 



for the hypothesis of a species-level polytomy in early bird evolution by using likelihood 
ratios to test whether gene trees had branch lengths significantly greater than at 
multiple loci. Since gene trees are theoretically expected to be resolved under the 
coalescent model, an alternative procedure would be to use probabilities of gene trees 
under the polytomous and resolved species trees and perform a likelihood ratio test for 
whether an estimated species tree branch length is significantly greater than 0. 
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6 Discussion 



Under standard models of sequence evolution, the distribution of site patterns of DNA 
does not depend on the position of the root of the g ene tree on which the sequence 
evolve (this is sometimes called the "pulley principle" ( FelsensteinlllQsil ) V Inference of 



the root of a gene tree requires additional assumptions, such as that of a molecular clock 
(mutation occurs at a constant rate throughout the tree), or inclusion of an outgroup 
taxon in the analysis, so that the root may be assumed to lie where the outgroup joins 
all other taxa in the study. We have shown, however, that under the coalescent model 
with five or more species, the distribution of unrooted topological gene trees preserves 
information about both the rooted species tree and its internal branch lengths. Thus 
in multilocus studies in which many gene trees are inferred, it is theoretically possible 
to infer the rooted metric species tree even in the absence of a molecular clock, known 
outgroups, or any metric information on the gene trees. While for some data sets it can 
be difficult to obtain either reliable roots or evolutionary times for branches of gene 
trees, these issues are not fundamental barriers to species tree inference. 

Although we have shown the theoretical possibility of identifying rooted species 
trees from unrooted gene trees by using linear invariants, we emphasize that we do 
not propose using these invariants as a basis for inference. Invariants of gene tree 
distributions are functions of their exact probabilities under the model — from finite 
data sets, gene trees are inferred with some error, and empirical estimates of gene 
tree probabilities from a finite number of gene trees might not satisfy invariants or 
inequalities that apply to the exact distribution. Moreover, many non-linear invariants 
which are not discussed in this paper (and not yet fully understood) further constrain 
the form of the gene tree distribution. 

In practice, very large numbers of loci might be needed to obtain approximate esti- 
mates of gene tree probabilities, and there must be considerable gene tree discordance 
in order to estimate probabilities of less pr obable unrooted g ene trees. For example, 
in an often-analyzed 106-gene yeast dataset ( Rokas et alll200^ . analyzing only the five 



species about which there is the most conflict, {S. cerevisiae, S. paradoxus, S. mikatae, 
S. kudriavzevil, and S. bayanus), yields the same unrooted gene tree for all 106 loci when 
inferred using maximum likelihood under the GTR -I- J" -I- 1 model without a molecular 
clock. If all observed gene trees have the same unrooted topology, then there is not 
enough information to infer the rooted species tree. Other data have shown more con - 
flict in unrooted gene trees, such as a 162-gene dataset for rice (|Cranston et a ] |2009l ). 



in which 99 of 105 rooted 5-taxon gene trees were represented in the Bayesian 95% 
highest posterior density (HDP) set of trees. 

If species tree branches are too long, it will not be possible to recover the rooted 
species tree from finite data. For example, if the species tree is {{{a,b):x,c):y,d):z,e), 
where y is sufficiently large, every observed gene tree (for a finite number of loci) might 
have the ABC\DE split. Being able to determine that e is the outgroup would require 
observing conflicting splits, such as that ABD\CE is more probable than ABE\CD. 
However, if y is large, these conflicting splits are likely to never be observed, making 
it difficult to distinguish between rooted topologies (((a, &), c), d), e), (((a, &), c), e), d), 
and (((a,6),c),(d,e)). 

On the other hand, if branches are too short, it might be difficult to distin- 
guish between certain rooted species trees, such as between (((a, &), c), (d, e)) and 
((a,6),(c, (d, e))) when the node immediately ancestral to c is very close to the root. 
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Further study would be needed for a precise understanding of how extreme brancli 
lengths affect the number of gene trees needed for reliable inference of the species tree. 
We note, however, that even when the rooted species tree cannot be fully inferred with 
great certainty, some rooted aspects of the tree might be recoverable. For example in 
the case of the rooted trees above, one might infer that (a, b) and (d, e) are rooted 
cherries on the tree, even if the placement of taxon c with respect to the root remains 
unknown. 

We again emphasize that invariants are not the most promising approach for infer- 
ring species trees from finite data, and that a maximum likelihood (ML) or Bayesian 
approach might be more appropriate. Given a set of sufficiently conflicting unrooted 
topological gene trees inferred by standard methods and then assumed to be correct, 
the rooted species tree could be inferred using ML, where the likelihood of the species 
tree is 

(2n-5)!! 

i=l 

where there are n taxa and the ith unrooted gene tree topology is observed times 
with Ui — N the total number of loci. The probability Ui of the ith gene tree 
depends on the species tree topology and branch lengths as outlined in Section O 
However, this 2-stage approach of gene tree inference followed by species tree inference 
does not take into account uncertainty in the gene trees, or cases in which inferred 
gene trees are not fully resolved. If there is not enough information in the sequences 
to estimate resolved gene trees, an approximation to equation (|10|l would be to either 
randomly resolve the tree if th ere are very many loci (as is oft en done in software 
i mplementing quartet puzzling (jStrimmer and von Haeseleij[l996l l or neighbor joining 
(|Saitou and Neilll987h l: or, if an unresolved gene tree has k resolutions, let the locus 
contribute a count of 1/k to each resolution. 

To better utilize the information in the unrooted gene trees, an attractive, but 
computationally more intensive, approach would use a Bayesian framework in which 
the posterior distribution of the rooted species tree is determined from posterior distri- 
butions of gene trees, thus taking into account uncertainty in the estimated gene trees. 
Cases in which ML would return an unresolved gene tree would likely correspond to a 
posterior distribution of gene trees with substantial support on more than one topol- 
ogy. Thus, instead of each locus contributing a count of one gene tree topology, it 
contributes fractional proportions to several topologies. In cases in which the gene tree 
distributions carry little information about the root of the species tree, the posterior 
distribution of the species tree would indicate this uncertainty by spreading the pos- 
terior mass over several species trees. The results of the present paper suggest that 
it is possible to extend current m odel-based me thods of inferring ro oted species trees 
(e.g., BEST (|Liu and Pearll |2007| '1 and STEM (|Kubatko et all 120091 ')') to cases where 
only unrooted gene trees can be estimated. 

Finally we note that invariants have a potenti al use in testing the fit of the multi- 
species coalescent model to a dataset. As noted in ( Slatkin and Pollackll20o5 ). processes 
such as population subdivision can lead to asymmetry in the probabilities of the two 
nonmatching rooted gene trees in the case of three taxa, thus violating the invariant 
in equation As shown in this paper, similar invariants can be obtained for larger 
number of species even when only unrooted gene trees are available, thus allowing the 
testing of the fit of the multispecies coalescent model in situations more general than 
the rooted 3-taxon setting. 
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A Tables for 5-taxon trees 



Table 4 The 105 rooted gene trees on 5 species. 
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Table 5 The 15 unrooted 5-taxon topological gene trees, as indicated by their non-trivial 
splits, and their probabilities Ui = P(Ti), where ri is the probability of the rooted gene tree 
Ri given the species tree cr+. 



Tree Splits Probability 
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B 5-taxon unrooted gene tree distributions 

B.l Balanced species tree 

For the 5-taxon balanced species tree of Fig. [2j;, 

f"*" = ({{<^, b):x, c):y, (d,e):z), 

let X = exp(— a;), Y = exp(— y), and Z = cxp(— z). Then the distribution of unrooted gene 
trees Ti is given by Ui = P^+(Ti) with 

2 2 1 1 •? 
ui = 1 - -X - -YZ + -XYZ + —XY^Z, 

3 3 3 15 

U2 = U3 = -YZ - -XYZ - —XY-^Z, 
3 6 10 

U4 = U13 = -X- -XYZ + —XY^Z, 
3 3 15 

U5=u(i = us= U12 = \xYZ - ^XY'^Z, 

D 10 

1 o 

U7 = US = uio = uii = ui4 = ui5 = —XY'^Z. (11) 



B.2 Rooted caterpillar species tree 

For the 5-taxon rooted caterpillar species tree of Fig. [2h. 

a'^ = {{{(a, b):x, c):y, d):z, e), 

let X = exp(— a;), Y = cxp(— j/), and Z = exp(— z). Then the distribution of unrooted gene 
trees Ti under the coalescent is given by Ui = ¥^+{Ti) with 

2 2 1 1 o 1 K 
ui = 1 ~ -X - -Y + -XY H XY-' H Xy^Z®, 

3 3 3 18 90 

U2 = -Y~ -XY - -XY^ + —XY-^Z^, 
3 6 9 90 

M3 = iy - ixy - —XY'-^ - — xy-'z^, 

3 6 18 45 

U4 = ni3 = ix - ixy -I- —XY-'' + —XY'^Z^, 
3 3 18 90 

= = ixy - ixy^ -f — xy^z'', 

6 9 90 

= «9 = ixy - — xy^ - — xy^z*^, 

6 18 45 

m = us = Jilo = Jill = M14 = M15 = ^^^'^ + TUT-^^^^''- (12) 



B.3 Pseudocaterpillar species tree 

For the 5-taxon pseudocaterpillar species tree of Fig. [2^, 

= (({a, b):x, (d, e):y):z, c), 

let X = exp(— a;), Y = exp(— i/), and Z = cxp(— z). Then the distribution of unrooted gene 
trees Ti is given by Ui = P^+(Ti) with 
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2 2 4 2 k 
uj = 1 ^ -X - -Y + -XY XYZ^, 

3 3 9 45 

U2=U3 = -Y - —XY + —XYZ'^, 
3 18 90 

n4 = ui3 = -X- —XY + —XYZ^, 
3 18 90 

1 1 fi 

"5 = «6 = = «9 = "10 = "12 = "14 = "15 = TZ^^ + 7^^'^^ i 

18 90 

us = uii = -XY - —XYZ'^. (13) 
9 45 ^ ^ 



C Nonbinary species trees 



Proof ( of Proposition The extension of Proposition [3] to nonbinary (j+ was discussed in 
Section [5] 

From this, for \X\ > 5 we know that for Q C X with |Q| = 4, the possibly unresolved 
unrooted quartet tree on Q can be determined from gene tree probabilities. Thus the unrooted, 
labeled species tree a~ can be det ermined by the identifiabi lity of (possibly nonbina ry) phylo- 
genetic trees from their quartets llBandelt and D ress 198^) llSemple and Steell l2003l . Theorem 
6.3.5), and thus Corollary |6] has been extended. 

Next, in addition to the three fully resolved rooted tree shapes on 5 taxa, we must consider 
the nine rooted shapes with polytomies. In Table [B] we designate these as Pi, . . . , Pg, specify 
an arbitrary labeling of the leaves of each, and list inequalities analogous to inequalities ||4}- 
ll6t for unrooted gene tree probabilities. The equivalence classes of labeled, binary, unrooted 
5-taxon gene trees associated with each polytomous species tree are given in Table [T] along 
with the gene tree probabilities as functions of transformed branch lengths X, Y , and Z. Gene 
tree probabilities are obtained from the equations for resolved trees in Appendix [B] by setting 
one or more branch lengths to 0. 

For the 5-taxon species tree shapes, in all cases of either resolved and polytomous trees, 
the least probable class, C, of gene trees always has probability strictly smaller than all others. 
There are five possible cases for the cardinality of C: 

1. |C| = 15: polytomy Pi 

2. |C| = 12: polytomy P2 or polytomy P3 

3. |C| = 10: polytomy P5 or polytomy P7 

4. \C\ = 8: resolved pseudocaterpillar 

5. |C| = 6: resolved caterpillar, resolved balanced, polytomy P4, polytomy Pe, polytomy Pg, 
or polytomy Pg 

If \C\ = 12, we can distinguish between P2 and P3 since all gene trees in the 3-element 
class for P3 have the same taxon not occurring in a cherry, while for P2 the gene trees in the 
3-element class have different taxa in this role. 

For |C| = 10, we can distinguish between polytomies P5 and P7 by considering the two 
2-element classes for both. For P5, both of the classes {T2,T3} and {r4,ri3} contain trees 
with one cherry in common. For P7, the gene trees in the classes {T2,T3} have a cherry in 
common, but those in {Tg,Tii} do not. Note that it is possible for these classes to degenerate, 
to form a 4-element class, but by counting the number of trees with a cherry in common in 
the larger degenerate class we can still determine whether the species tree shape is P5 or P7. 

If |C| = 6 the cardinality of the class with the second smallest probability determines the 
rooted tree shape in some cases. The class with the second smallest probability has cardinality 
2 only for the resolved caterpillar, cardinality 3 only for Pg, cardinality 6 only for polytomies 
P4 and Pg, and cardinality 4 for the resolved balanced tree and Pg. 

At this point we have determined the rooted unlabeled topology of the species tree from 
the 5-taxon gene tree classes, except for the P4 versus Pg case and the balanced versus Pg 
case. (We will return to these cases later.) 

For the fully resolved trees, Proposition|8]explains how we determine the labeling, so similar 
arguments are needed for each polytomous tree. If the species tree is Pi, there is nothing to 
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Table 6 Representatives for the 9 nonbinary, rooted 5-taxon species tree shapes, with in- 
equalities for gene tree probabilities. 



Species tree Newick representative inequalities 



species tree shape 



Pi 


(a, b, c, d, e) 




P2 


(a, 6, c, (rf, e):z) 


til > ^2 


Ps 


((a, b, c, d):z, e) 


U3 > «l 


Pi 


((a, b, c):y, d, e) 


Ml > tt2 > 117 


Ps 


((a, b):x, {d, e):y, c) 


Ui > U2,U4 > US 


P& 


(((a, b):x, c):y, d, e) 


Ml > 112, n4 > Ms > My 


P? 


((a, d, e):z, c) 


Ml > M2, Mg > M4 


Ps 


(((a, 6, c):3/, (d, e):z) 


Ml > M2 > M7 


P9 


(((a, fe, c):3/, d):z, e) 


Ml,M3 > M2 > M7 



do. For polytomies P2, P5, and P^/Ps the labeling on the unrooted tree determines that on 
the rooted one. 

If the species tree is P3, the taxon that appears in no cherry in the gene trees in the 
3-element class is the one that is an outgroup to all the others in the species tree. 

For polytomy P7, the resolved cherry in the species tree is determined by the unrooted 
labeled tree, and the outgroup is determined by not appearing in a cherry in the most probable 
gene tree. 

For polytomy Pg, the non-outgroup taxon which is not descended from the polytomy in 
the species tree is distinguished by not appearing in any cherry in the class with the second 
smallest probability. Calling this identified taxon d, the outgroup taxon is determined as the 
one appearing in a cherry with d in three of the six most probable trees (i.e., in three trees in 
the union of the two most probable classes, which may degenerate to a single class). 

Finally, the labeling on the balanced / Pa tree is determined as it was for the balanced tree 
in the proof of Proposition [7] 

At this point wc have determined the labeled rooted species tree topology 1/)+ (except for 
two Pi/Pg and balanced/Pg ambiguities). 

It remains to determine branch lengths on 1/)+. If the unlabeled species tree is any tree 
other than the resolved balanced tree, P4, Pg, or Pg, then the branch lengths can be solved 
from the system of equations listed for the given species tree from Appendix [B] or the formulas 
in Table El 

If the species tree is known to be cither P4 or Pg, we note that Pg degenerates to P4 as 
z (or Z 1). Solving the system of equations for the m^s in terms of the branch lengths 
for Pg, if Z < 1, then the species tree is Pg. If Z = 1, then the species tree is P4. Similarly, 
Pg is the limiting case of the balanced 5-taxon species tree as Z — > 1. Solving the system of 
equations for the balanced tree, the species tree is the balanced tree if Z < 1 and is Pg if 
Z = 1. 

Thus, for a 5-taxon species tree, even with polytomies, (t+ is identifiable. This extends 
Propositions [71 and [8l to potentially nonbinary species trees. Theorem |9l also extends, noting 
that if the root of the species tree has degree greater than 2, then its location will be identified 
by some 5-taxon subtree with the same property. The proof of Corollary 1101 did not use the 
assumption that (t+ is binary, so it applies to nonbinary species trees as well. □ 
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Table 7 Equivalence classes of equiprobable gene trees for each nonbinary 5-taxon species 
tree, and gone tree probabilities in terms of transformed branch lengths. T denotes the set of 
all 5-taxon binary unrooted gene trees, {T\,T2, ■ ■ ■ ,Ti5}. 



Species tree equivalence classes gene tree probabilities 



Pi 


T 


Ui 




1 

15 




{Ti,T4,Ti3} 


Ul 


= 


1 4 y 
3 ~~ 15 




r\{Ti,T4,Ti3} 


U2 






Pz 




Its 


= 


1 2 76 

9 ~ 45 




T\{T3,n,Tg} 


Ul 


= 


1 I 1 76 

18 90 


Pa 


{Ti,T4,Ti3} 


Ul 




1 1 V ^ 1 V3 
3 3 15 




{T2,T3,n,T6,Tg,Ti2} 


U2 




Iv 1 V3 
6 10 




{Ty, Ts, Tio, Til, Ti4, T15} 


U7 




1 v3 
15 


P5 


mi 


Ul 


= 


1- §x-§y + fxy 






U2 


= 


iy„^xy 




m,Ti3} 


U4 


= 


ix- Axy 




r\{Ti,T2,T3,T4,Ti3} 


U5 


= 




Pe 


{Ti} 


Ul 


= 


1 - fx - |y + ^XY + ^xy3 




{T2,T3} 


U2 




ly _ ixy- jLxyS 




{74,Ti3} 


U4 








{T5,T6,Tg,Ti2} 


U5 




|xy-^xy3 




{TV, Ts, Tio, Til, Ti4, T15} 


U7 


= 


^xy3 


P7 


mi 


Ul 


= 






m,T3} 


U2 




3 ~ T8^ + 90^^ 




m,Tii} 


Us 




9^ ~ 45^^ 




r\{Ti,T2,T3,T8,Tii} 


U4 




+ ^xz*^ 


P8 


{Ti,T4,Ti3} 


Ul 








{72,T3,T5,r6,T9,Ti2} 


U2 




lYZ-l^Y^Z 




{T7, T8, Tio, Til, Ti4, Tis} 


U7 




^Y^Z 


P9 


{Ti,T4,Ti3} 


Ul 




1 1 y 1 1 y3 i_ 1 y3 76 
3 3 ^ 18 "'"90 




{T2,T5,Ti2} 


U2 








{T3,T6,T9} 


113 








{T7,T8,Tio,Tii,Ti4,Ti5} 


U7 




^y3 + ^y326 



